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Abstract. Simple stochastic games are two-player zero-sum stochastic games with turn- 
based moves, perfect information, and reachability winning conditions. 

We present two new algorithms computing the values of simple stochastic games. Both 
of them rely on the existence of optimal permutation strategies, a class of positional strate- 
gies derived from permutations of the random vertices. The "permutation-enumeration" 
algorithm performs an exhaustive search among these strategies, while the "permutation- 
improvement" algorithm is based on successive improvements, a la Hoffman-Karp. 

Our algorithms improve previously known algorithms in several aspects. First they run 
in polynomial time when the number of random vertices is fixed, so the problem of solving 
simple stochastic games is fixed-parameter tractable when the parameter is the number 
of random vertices. Furthermore, our algorithms do not require the input game to be 
transformed into a stopping game. Finally, the permutation-enumeration algorithm does 
not use linear programming, while the permutation-improvement algorithm may run in 
polynomial time. 



Introduction 

Simple stochastic games (SSGs) are played by two players called Max and Min in a 
sequence of steps. The players move a pebble along the edges of a directed graph (V, E) 
whose vertices are partionned into three sets: VM aX) VMm, and Vr. When the pebble is on 
a vertex of Vuax or VMin> the corresponding player chooses an outgoing edge and moves the 
pebble along it. When the pebble is on a vertex of Vr (a random vertex), the outgoing 
edge is chosen randomly according to a fixed probability distribution. The players have 
opposite goals, as Max wants to reach a special sink vertex © while Min wants to avoid it 
forever. An example of SSG is depicted in Figure [TJ with vertices of VMax represented as 
O's, vertices of VMin represented as n's, and vertices of Vr, represented as A's. 

SSGs are a natural model of reactive systems. Consider, for example, a hardware 
component. It can be modelled as an SSG, whose vertices represent the global states of 
the component and the target is some error state to avoid. The nature of a given vertex 
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Figure 1: A Simple Stochastic Game. 



depends on who can influence the immediate evolution of the system: it is a Min vertex if the 
software can choose between different options, a Max vertex if there is a (non-deterministic) 
input asked from the user, and a random vertex if the evolution depends on a stochastic 
environment. An optimal strategy for Min can then be used as the basis for the synthesis 
of a "good" driver, i.e. one which minimises the probability of entering the error state 
independently of the behaviour of the user. 

The main algorithmic problem about SSGs is the computation of values of the vertices 
and optimal strategies for the players. This problem was first adressed by Condon, who 
showed that deciding whether the value of a vertex is greater than | belongs to NP and 
co-NP [Con92j. Condon's algorithm guesses non-deterministically the values of vertices, 
which are rational numbers of linear size, and checks that they are solutions of some local 
optimality equations. This algorithm is correct only for stopping games, where the pebble 
reaches either the target or a sink target with probability one, regardless of the players' 
strategies. Any SSG can be transformed in polynomial time into a stopping SSG with 
(almost) the same values, but it incurs a quadratic blow-up of the size of the game. 

Three other algorithms for solving SSGs are presented in |Con93j . The first one com- 
putes the values of the vertices using a quadratic program with linear constraints. The 
second one computes iteratively from below the values of the vertices, and the third is a 
strategy improvement algorithm a la Hoffman-Karp [HK66]. The two latter algorithms, 
as the ones recently proposed in [Som05j, solve a series of linear programs which could be 
of exponential length. Furthermore, solving a linear program requires high-precision arith- 
metic, even if it can be done in polynomial time |Kha791 IRen88| . The best randomised 
algorithms achieve sub-exponential expected time e°(^™) [Lud95, H al07| . 

In this paper we present two algorithms computing the values and optimal strategies 
in SSGs: the "permutation-enumeration" and the "permutation-improvement" algorithms. 
The common basis for both algorithms is that optimal strategies can be looked for in a 
subset of the positional strategies called permutation strategies. Permutation strategies are 
derived from permutations over the random vertices. In order to find optimal strategies, the 
permutation-enumeration algorithm performs an exhaustive search among all permutation 
strategies, whereas the permutation-improvement algorithm performs successive improve- 
ments of permutation strategies, a la Hoffman-Karp [HK66J . 
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The permutation-enumeration and the permutation-improvement algorithms share two 
advantages over existing algorithms. First, they perform much better on SSGs with few 
random vertices, as they run in polynomial time when the number of random vertices is 
logarithmic in the size of the game: it follows that the problem of solving SSGs is fixed- 
parameter tractable when the parameter is the number of random vertices. Second, they 
do not rely on the transformation of the input SSG into a stopping SSG, which avoids 
the quadratic blow-up of the size of the game. Moreover, the permutation-enumeration 
algorithm does not use linear or quadratic programming, (it just computes the solutions 
to linear systems) and its worst-case complexity is 0(|Vr|! • (\E\ + |5|)), where |Vr| is 
the number of random vertices, \E\ is the number of edges and |<5| is the maximal bit- 
length of transition probabilities. The nominal complexity of the permutation-improvement 
algorithm is higher but we do not know any non-trivial lower bound for its complexity: the 
permutation-improvement algorithm may actually run in polynomial time. 

Outline. In Section[TJ we provide formal definitions for SSGs, values and optimal strategies. 
We describe then in Section [2] the central notion of permutation strategies. Section [3] 
presents the permutation-enumeration algorithm, based on the self- consistency and liveness 
properties. Section 2] introduces an improvement policy for permutations which leads to the 
permutation-improvement algorithm. 

1. Simple Stochastic Games 

1.1. Plays and strategies. A simple stochastic game is a tuple (V, Vyiax, ^Min, Vr, E, 5, ®), 
where (V, E) is a graph, (Vme«> ^Mim Vr) is a partition of V, and © is a distinguished sink 
vertex in V called the target of the game. The transitions from the random vertices are 
equipped with probabilities described by the function 6 : Vr — > V — * [0, 1], such that for all 
v G Vr, w eV, 5(v)(w) > => (v,w) G E, and Y. w &v S(v)(w) = 1. 

An infinite play p is an infinite sequence po/°i ■ • • € V w of vertices such that for all 
j G N, (pi, pi+x) 6 E. It is winning for Max if there is a i G N such that pi = © (as © is 
a sink, it follows that Vj > i,pj = ©). Otherwise, p is winning for Min. A finite play is a 
finite prefix of an infinite play. 

A (pure) strategy for Max is a mapping a : V*VMax —> V such that for each finite play 
h = ho . . .hi ending in a Max vertex, (hi,a(h)) G E. It is positional if it only depends 
on the last vertex of h: o~(h) = a (hi). A play popi ... is consistent with a if for every i 
such that pi G Vmek? Pi+i = a (Po ■ ■ ■ Pi)- Strategies for Min are defined analogously and are 
generally denoted by r. 

1.2. Measures and values. The set of plays is made into a measurable space on the 
(T-algebra generated by the canonical projections {Vi}j £ N, where Vi(popi . . .) = Pi [Bil95j . 
Once an initial vertex v and two strategies a and r for players Max and Min have been 
fixed, the probability measure P«' T is defined by: 

P-> T (Vb = v) = 1 , 

P£> T (V- +1 = a(V ...Vi)\ViE VMax) = 1 , 
P£> T (V- +1 = t(V ...Vi)\Vie Vuin) = i , 
P£> T (V i+1 I V, G V R ) = S(Vi)(V i+1 ) . 
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The expectation of a real-valued, measurable and bounded function ip under P^' T is denoted 
E^' T [up] . We will often use implicitly the following formulae which rule the probabilities and 
expectations once a finite prefix h = ho ■ ■ ■ h{ is fixed: 

P^(T\V ...V l = h ...h l ) = F a h f ] ' T[h] (T[h}) , (1.1) 

E^[<p\V ...V i = h ...h i ] = E% h] ' T[h] [<p[h]] , (1.2) 

where a[h] (popi ■■ •) = o~(ho . . . hi-ipopi ■ ■ ■), and r[h], T[h], and (p[h] are defined analo- 
gously. 

If we fix only Max's strategy a and the initial vertex v, the target vertex will be reached 
with probability at least: 

infP^ T (Reach(©)) , 

T 

where Reach(@) is the event {3i G N,V; = ©}. Starting from v, player Max has strategies 
that guarantee a winning outcome with a probability greater than: 

val*0) = supinf P£' r (Reach(®)) , 

minus e for any e > 0. Symmetrically, Min has strategies that guarantee a winning outcome 
with a probability less than: 

vaT(u) = infsupP£' T (Reach(®)) , 

T cr 

plus e for any e > 0. It is clear that val*(u) < val*(u). In the case of SSGs, stronger results 
are known: 

Theorem 1.1 f |Sha53l[GiIF7llLL69] l. Let G = (V, Vm& x , ^Min, Vr, E, ©, 5) be a SSG. Then, 
for any vertex v G V , 

vah(u) = val*(u) . 

This common value is denoted byv&\(v). Furthermore, there are positional optimal strategies 
for both players, i.e. positional strategies and such that, for any strategies a and t: 

P a v ' T * (Reach(®)) < val(u) < P^ # ' T (Reach(®)) . 

1.3. Normalised games. A SSG is normalised if the only vertex with value 1 is the target 
© and there is only one (sink) vertex © with value 0. Our motivations for the introduction 
of this notion are twofold. First, several proofs are much simpler for normalised games. 
Second, any SSG can be reduced to an equivalent normalised game in linear time and the 
resulting game is smaller than the original one. This reduction is presented on Figure [2 it 
simply consists in merging the region with value one into © and the region with value zero 
into a new sink vertex ®. 

In the remainder of this article, we assume that we are working on a normalised SSG 
G = (V, VMax, ^Minj Vr, E, 5, ©, ®), with k random vertices. 
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val = 




Figure 2: Normalisation. 



2. Permutation strategies 

The existence of positional optimal strategies is a key property of SSGs and the cor- 
nerstone of many algorithms solving these games. The algorithms we propose rely on a 
refinement of this result: optimal strategies can be looked for among a subset of the posi- 
tional strategies, the set of "permutation strategies". 

As a matter of fact, Theorem 11.11 is a corollary of results of the present paper. The 
proofs of our results often rely on the existence of values and optimal (not only e-optimal) 
strategies in SSGs. This could be avoided — the main point is to use val* instead of val — 
but we felt that it was not worth the extra complexity. 

The main intuition underlying permutation strategies is that the only meaningful events 
in a play are the visits to random vertices. Between two visits the players only strive to 
impose which random vertex will be visited next, and the result of their interaction can be 
easily predicted. This is illustrated by Figure EJ which zooms on two details of Figured) 




Figure 3: Coherence and contention. 

In the left part of Figure [3l Max must choose between the two random vertices b and 
c (refusing to choose is not really an option). There is no reason to choose b in one of the 
vertices, and c in the other. We could consider only the strategies "always go to 6" and 
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"always go to c" . In the right part of Figure EJ we consider relationships between the two 
players' strategies. From their respective vertices O and □, Max and Min can send the 
pebble to either a or b. We could restrict our attention to the cases where Max goes to one, 
and Min to the other. 

Underlying these intuitions is the idea of a "preference order" over the random vertices. 
In the remainder of this article, we formalise it as a permutation: a one-to-one correspon- 
dance f between Vr and {1, . . . , k}. For simplicity, we often write fj instead of f (i) and 
we consider the sink and target vertices as random vertices with the implicit assumption 
that they are respectively the lowest and greatest vertices: fo = <8> and i^+i = ©. 

2.1. Attractors and f-regions. Once a permutation f : Vr — > {1, . . . , k} has been fixed, 
the f-strategies consist in Max trying to reach the highest (with respect to f) possible 
random vertex, while Min tries to thwart her. Notice that the situation is not exactly 
symmetric, since the burden of reaching a random vertex lies with Max: in case the peb- 
ble remains forever in controlled vertices then player Min wins. The formal definition of 
permutation strategies is based on the notion of deterministic attractor. 

Definition 2.1. Let X C V be a set of vertices. The deterministic attractor of Max to X, 
denoted by DetAtt(X), is computed recursively: 

X° = X , 

X i+1 = X i (j | w g ^ | 3w £ ^ w j g £| 

U {v G Vuin \Vw eV,(v,w) g E w G X 1 } , 
DetAtt(X) = (Jx i . 

i>0 

An attracting strategy to X for Max is a positional strategy a such that: 

Vi > l,o-(X { ) C X 1 - 1 . 
Symmetrically, a trapping strategy out of X for Min is a positional strategy r such that: 

t(V \ DetAtt(A)) C V \ DetAtt(A) . 

The {-regions associated with a permutation f : Vr — > {1, . . . , k} are defined as embed- 
ded deterministic attractors to the random vertices: 

W { [k + 1] = {©} , 

Vl<i<k,W t [i) = DetAtt({f 4 ,...,f fc ,@})\(Jiy f [j] , 

j>i 

W t [0] = {®} ■ 

2.2. Permutation strategies. The {-strategies o~f and Tf are strategies such that, on each 
Wf[i\: 

• <Tf coincides with an attractor strategy to {fj, . . . , f^, ©}, 

• Tf coincides with a trapping strategy out of {fj+i, • • • , ffc, ©}• 
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The f-regions partition V, so we extend the definition domain of f : Vr, — > {1, ... ,k} 
to V in a natural way: f (y) = i if v G Wf[i]. The following properties are easy to prove: 

Vl> G F M ax, f(v) = f((7f («)) , (2.1) 

Vw G VMaxj V(u, to) G £7, f(t»)>f(w) , (2.2) 

Vl> G F M in, f(w) = f(r f («)) , (2.3) 

V«ey M m,VKw)6£, f(«)<f(«i). (2.4) 



If Max plays erf and Min plays t? from an initial vertex v, the first random vertex reached 
by the pebble is the unique random vertex w such that f(w) = f(v). Figure H] describes the 
f-regions and f-strategies of the game of Figure [TJ for f = abed. 




Figure 4: f-regions and f-strategies in the game of Figured! 



2.3. The f-values. When both players use their respective permutation strategies, the 
probability that a pebble starting in v reaches © is denoted by ipf(v): 

<p t (v) =P£ f ' Tf (Reach(®)) . 

Proposition 2.2. Let f be a permutation. The {-regions and the f-strategies can be com- 
puted in time 0(\E\log*(\V\)) and the {-values can be computed in time 0(|Vr| 3 |J|). 

Proof. The f-regions and f-strategies can be expressed in terms of deterministic games as 
they do not depend on what happens once a random vertex is reached. We can thus use 
the results of [AHMS08J to compute them in time C(|£7 1 log* ( | V | ) . In order to compute 
the f-values, we build a Markov chain Mf designed to mimic the behaviour of G when the 
players use their f-strategies. Intuitively, we merge each region Wf[i] into a single vertex i; 
formally, A4f is a Markov chain with states S = {0, . . . , k + 1} such that and k + 1 are 
absorbing and, for every 1 < i < k and < j < k + 1, the transition probability from i to 
j is given by: 

The values x* : {0, . . . , k + 1} — »• [0, 1] of Mf are computed as follows. Let I C S be 
the set of vertices from which k + 1 is reachable in Aif . Then, for each i ^ /, x* = 0, and 
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(x*)i£i is the unique solution of the following linear system: 



x *k+l — 1 



22jeiPi,j 



■ x 



which can be solved in time 0(|Vr| 3 |<5|) |Dix82j . For each v € V, ff{v) = %f/ v y D 



3. The permutation-enumeration algorithm 

In this section we describe the permutation-enumeration algorithm which computes 
optimal strategies for both players. This algorithm relies on the following key property of 
permutation strategies. 

Theorem 3.1. In every SSG, there exists a permutation f such that at is optimal for Max 
and Tf is optimal for Min. 

This theorem suggests a very simple enumerative algorithm computing values and op- 
timal strategies: check for each permutation f whether the f-strategies are optimal. Each 
test can be performed in polynomial time using linear programming |Der72} ICon92| . How- 
ever, linear programming requires high-precision arithmetic and is expensive in practice. 
Our permutation-enumeration algorithm uses a simpler criterion based on a refinement of 
Theorem 13.11 we look for permutations which are live and self- consistent. 

3.1. Liveness and self-consistency. The permutation-enumeration algorithm is based 
on two simple properties on permutations: self-consistency and liveness. Self-consistency 
expresses the adequation between a priori preferences (permutation f) and resulting values 
(the f-values ipf). Liveness stipulates that each random vertex has a positive probability to 
immediately lead to a better — from Max's point of view — region. 

Definition 3.2. A permutation f is self-consistent if: 

ft (fi) < ft {h) < ■■■< ft (ffc) ■ 

Definition 3.3. A permutation f is live if: 

VI < i < k,3j > i,3v e Wf\j],6(fi)(v) > . 

As we show below, the f-strategies associated with a live and self-consistent permuta- 
tion f are optimal and there is always such a permutation. The permutation-enumeration 
algorithm performs an exhaustive search for a live and self-consistent permutation. 



Input: A normalised simple stochastic game G = (V, Vuax, ^Min> Vr, E, 5, ©, ®). 
Output: Optimal strategies for Max and Min. 

1 forall permutation f over Vr do 

2 compute the f-regions; 

3 compute the f-values; 

4 if f is live and self- consistent then 

5 I return <jf and Tf ; 



Algorithm 1: The permutation-enumeration algorithm. 
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Theorem 3.4. The permutation- enumeration algorithm terminates and returns optimal 
strategies for Max and Min. Its worst-case running time is 0(|Vr|! • (\E\ + \S\)) . 

Proof. Correctness and termination are proved in Lemmas 13.71 and 13.101 respectively. The 
worst-case complexity follows from the fact that there are at most k\ permutations and 
Proposition E2J □ 

Before we proceed with the proofs of the main lemmas, let us make a case for live- 
ness: Figure [5] shows that self-consistency is not enough to guarantee the optimality of the 
resulting strategies^. 




Figure 5: Self-consistency does not guarantee optimality. 

In this excerpt from the game of Figure [TJ Max's strategy in O should be to send 
the pebble to b, as Min could otherwise trap the play in {a, O, □}• However, consider the 
permutation g = bead: Min sends the pebble from □ to c to avoid a; Max sends the pebble 
from O to □ to reach either a or c. We have thus (Pg(a) = Vgl )- As a matter of fact, we 
have (f s (b) < f s (a) = </? g (c) < (p g (d), so g is self-consistent even though the g- values are 
not the correct ones. Liveness forbids this kind of gambits from Max. It replaces, in this 
aspect, the "stopping" hypothesis of Condon. 

3.2. Correctness of the permutation-enumeration algorithm. We first show that if 
a permutation f is live and self-consistent, the f-strategies are optimal (Lemma 13. 7|) . We 
need two preliminary propositions. First, if f is self-consistent and Max plays according 
to <7f, the sequence ((pt (Vi))^ is a submarting alH and symmetrically if f is self-consistent 
and Min plays according to Tf the sequence (^>f(Vi))i^ is a super martingale. 

Proposition 3.5. Let f be a self-consistent permutation. Then, for any strategies a and r 
for Max and Min, vertex v, and integer i, 

W v ^ [<p t (Vi+i) \V ...Vi\> MVi) , (3.1) 

E?" [<p f (K+i) \V ...Vi}< MVi) ■ (3.2) 

4t would be enough in stopping games, but testing liveness is cheaper than the reduction. 

o 

We do not use any result about martingales in this paper. 
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Proof. In order to prove (13. lh . it is enough to show that for any finite play h = ho ■ ■ ■ hi, 

E°*> T [ <p { (V i+1 ) \V ...V i = ho...h i ] > ip f (hi) . (3.3) 

Depending on the owner of hi, (|3.3p follows from one of the following properties of iff : 

Vf G VMax, <Pf(v) = Wf{(Tf{v)) , (3.4) 

Vf G Vum, V(v,w) G E, (f t (v) < (ff(w) , (3.5) 

Vf G Vr, ip t (v) = ^2 5{v)(w) ■ <pf(w) . (3.6) 

The equations (|3.4p and (j3. 6f) follows from the definition of ff, and (13. 5p follows from 
the self-consistency of f: by definition of the f-regions, if v G VMm and (v,w) G -E then 
f(v) < f(w) (see (I2.2p ). so iff (v) < ff(w). The proof of (13. 2f) is similar and we do not detail 
it. □ 

Second, we show a "stopping property" in the case where f is live and Max plays Of . 

Proposition 3.6. Let f be a live permutation. Then, for any strategy r for Min and initial 
vertex v, 

P£ f ' r (Reach(@) V Reach(©)) = 1 . 
Proof. By definition of liveness, 

Kiss ' — ' 

is positive. Let n = [V| and fe = |VrJ then the definition of Of yields: 

P" f,T (K = ® | Vfc + ®) > « fc , 

or, since © and <g> are sinks: 

P" f ' T (Vm < n, V m i {©, 0}) < 1 - a fc . 

Equation (II. ip yields: 

Vi G N, P£ f > T (Vm <i-n,V m i {©, ®}) < (1 - a' )* , 

hence P£ f ' T (Vm G N, F m ^ {©, <g>}) = hence Proposition EJ3 □ 

We can now prove the correctness of the permutation-enumeration algorithm: 

Lemma 3.7. Let f be a live and self- consistent permutation. Then Of is optimal for Max 
and Tf is optimal for Min. 

Proof. We first prove that Of ensures that a pebble starting from v has probability at least 
iff(v) to reach ©: 



P£ f ' T (Reach(®)) = E. 



(Tf ,T 



(3.7) 



= limE^f^)] (3.8) 

> K UT iMVo)] = Mv) , (3-9) 

where (13. 7p comes from Proposition 13.61 (|3.8p is a property of expectations, and (|3.9p comes 
from Proposition 13.51 
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Then, we show that T{ ensures that a pebble starting from v has probability at most 
(ff(v) to reach ©: 



^' Tf (Reach(®)) < E£' Tf 
< 



liminf ipr (Vi) 
ieN * tv ; 



liminfE^^f^)] 

«6N 



(3.10) 
(3.11) 

< K' Tt [<Pi<y Q )\ = Mv) , (3.12) 

where (|3.10j) holds because © is a sink and (pf (®) = 1, (|3.1ip is a property of expectations, 
and (|3.12p comes from Proposition 13.51 

Thus, for any strategies a and r for Max and Min, 

P£' Tf (Reach(®)) < P£ f ' Tf (Reach(®)) < P^' T (Reach(®)) , 

which completes the proof of Lemma 13.71 □ 



3.3. Termination of the permutation-enumeration algorithm. Now we show the 
existence of a live and self-consistent permutation (Lemma l3.10|) . Our construction is based 
on Proposition 13.81 and its correctness on Proposition 13.91 

Proposition 3.8. Let X C V be a set of vertices including the target vertex ® and Y be 
V \ DetAtt(X). Then either Y = {®} or there is a random vertex vinY such that: 

val(u) = max{val(w) | w E Y} , 

3w E DetAtt(X),<5(u)(w) > . 

Proof. Let Z be the set of vertices with maximal value in Y: 

Z = {v £ Y | val(t>) = maxval(w)} , 

and suppose that: 

€ Vr H Z, E Det Att (X),5(v)(w) = . 
Let v be a vertex in Z. As G is normalised, we just need to show that val(v) = 0, i.e. there 
is a strategy for Min such that for every strategy a for Max, P£' (Reach(©)) = 0. 

By definition of DetAtt(X), there is a positional strategy r for Min such that r(Y) C Y , 
and it follows from the definition of Z that t{Z) C Z. As Z is also closed under random 
moves, a pebble starting in Z can only leave Z through a move of Max, which leads to Y \ Z 
as Y = V \ Det Att (X). 

We define now a non-positional strategy 9 in which Min plays according to r as long 
as the play remains in Z and switches definitively to an optimal strategy the first time the 
pebble moves out of Z. We can thus partition the plays starting in v and consistent with 
a and 6 depending on if and where the play gets out of Z: Tz is the set of plays remaining 
forever in Z, and for each w in Y \Z, T w is the set of plays where w is the first visited vertex 
outside of Z. Clearly P^ ,6l (Reach(@) | Tz) = and by definition of the strategy 6, E 
Y\Z, P£' e (Reach(@) | T w ) < val(w). Hence P£ ,e (Reach(®)) < max(0, max weY \z val(w)) 
and since this holds for every cr, val(f) < max(0, max„, G y\^ val(w)). By definition of Z this 
implies val(t> ) =0. □ 
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Proposition 3.9. Let f be a live permutation such that: 

val(fi) < val(f 2 ) < . . . < val(f fc ) . 

Then f is self-consistent. 

Note that under the same hypotheses, Lemma 13.71 imply that f-strategies are optimal. 

Proof. We first show that: 

V« G F, VI < i < k, (f(v) =i) => (val(t>) = val(fj)) . (3.13) 

Consider the strategy a* , which mimics <7f until the first time the pebble reaches a random 
vertex and then switches definitively to an optimal strategy. By definition of <7f , the first 
random vertex belongs to {fj, . . . , f^, ©}, so a* ensures that a pebble starting in q reaches 
© with probability at least min{val(fj), . . . , val(ffc), val(©)} = val(fj). A similar strategy 
r* for Min ensures that this probability is at most val(fj). So val(u) = val(fj), and (|3.13j) 
follows. 

Now we prove that val and ipf coincide. According to (13. 13f) and by definition of 
permutation strategies, 

€ V Max , val(u) = val(cr f («)) , 

V-u € Vuin, val(u) = val(r f (v)) , 

Vi> G Vr, val(u) = H v )( w ) ' val(it?) . 

So, if Max and Min play according to their f-strategies, the sequence val(Vi)j g N is a mar- 
tingale: 

E^> Tf [val(F +1 ) \V ...Vi\ = val(F) . (3.14) 
Consequenly, for every vertex v, (ff(v) = val(u): 



p f ( v ) = P£f' T f (Reach(®)) = E£ f ' Tf 



limval(K) 



(3.15) 



= limE£ f ' Tf [val(V-)] (3.16) 

= E^ T [val(Fo)] = val(^) , (3.17) 

where f)3. 15|) comes from Proposition 13.61 (|3.16p is a property of expectations, and (|3.17p 
comes from A3. 14H . Since val and (ft coincide, the hypothesis yields the self-consistency of 
f . This completes the proof of Proposition 13.91 □ 

Lemma 3.10. There exists a live and self- consistent permutation. 

Proof. We use iteratively Proposition 13.81 in order to build a permutation f such that, for 
every k > i > 1, 

• val(fj) = max {val(t>) | v G V \ DetAtt(fj+i, fj+2, . . . , ffc)}; 

• 3w€ DetAtt(f m ,f i+2 ,...,ffc),5(fi)H > 0. 

By construction f is live and val(fi) < val(f2) < ... < val(f^). Proposition 13.91 yields the 
self-consistency of f, and Lemma 13.101 follows. O 
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4. The permutation-improvement algorithm 

A drawback of the permutation-enumeration algorithm is that it considers each and ev- 
ery possible permutation of the random vertices, so |Vr|! is a lower bound for the worst-case 
complexity of this algorithm. Strategy-improvement algorithms avoid such enumerations, 
instead these algorithms proceed by successive improvements of a strategy: information 
about sub-optimality of a strategy is used to determine a "better" strategy, which en- 
sures convergence to an optimal strategy. In this section, we emulate this idea with a 
permutation-improvement algorithm. 

4.1. A natural but incorrect improvement policy. Starting from an initial permu- 
tation f , we would like to improve f again and again until the permutation strategies <7f 
and Tf are optimal. To test optimality we check that f is live and self-consistent (see 
Lemma [3. 7p . When f is live but not self-consistent we compute a new permutation g which 
is live and "better" than f. A natural improvement policy consists in choosing g consistent 
with the f-values i.e. g refines the pre-order induced by (pe. Unfortunately this is too 
naive: the corresponding algorithm does not always terminate!, a counter-example is given 
by Figure EJ 




Figure 6: A counter-example for the naive improvement algorithm. 

If we start with the permutation f = acb, the f-strategies are as follows: in O Max goes 
to b and in □ Min goes to c. Hence, the f-values of vertices a, c, and b are respectively .82, 
.9, and .5, so f is not self-consistent. The next permutation is g = bac, and the following 
g-strategies ensue: in O Max goes to o and in □ Min goes to O- The g-values of vertices 
6, a, and c are respectively .5, .1, and .18, so g is not self-consistent either. Moreover, the 
next permutation is f = acb, so the naive algorithm oscillates endlessly between f and g, 
never reaching the correct permutation abc. 

4.2. A correct improvement policy. The correct permutation-improvement policy is a 
bit more complex: given a live but not self-consistent permutation f , we choose a permu- 
tation g which is live and self-consistent in the one-player game G[<7f], where vertices of 
player Max have only one outgoing edge: the edge consistent with the positional f-strategy 
o"f . This improvement policy guarantees that the value of <7 g is greater than the value of <7f 
(see Lemma l4.4p and is implemented by the following algorithm. 



ctually, the naive algorithm terminates (and is correct) in the special case of one-player games |Hor08j . 
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Input: A normalised simple stochastic game G = (V, VMax> 


^Min, V R ,E,6,©,®). 




Output: Optimal strategies for Max and Min. 




1 


Pick a live permutation f; 




2 


repeat 




3 




if f is self- consistent in G then 




4 




| return <jf and T{ ; 




5 




else 




6 




|_ replace f with a live and self-consistent permutation 


in G[(j f ] ; 



Algorithm 2: The permutation- improvement algorithm. 



The computation of a live and self-consistent permutation in G[af ] in line [2] relies on 
the computation of values of the one-player game G[<7f]. Details are given in the proof of 
the following theorem. 

Theorem 4.1. The permutation-improvement algorithm terminates and returns optimal 
strategies for Max and Min in at most \Vr\\ improvement steps. Furthermore, each im- 
provement step can be carried out in polynomial time. 

Proof. According to Lemma [4.21 the algorithm returns a permutation which is both live and 
self-consistent in G hence according to Lemma [3. 71 the corresponding permutation strategies 
are optimal in G which proves correctness of the algorithm. 

Termination and the maximal number of iterations follows from Lemma 14.41 which 
proves that sucessive strategies <7f have better and better values. 

The computation of a live and self-consistent permutation in G[o~f ] in line[2]is achieved in 
polynomial time in the following way. First, compute the values of the one-player game G[o~f] 
using linear programming [HK79, Con93]. Second, build in linear time a live permutation 
g consistent with these values like in the proof of Lemma 13.101 The permutation g is such 
that vaL Tf (gi) < val<7f(g2) < ... < valcrf(gfc), where val CTf denotes the values in the game 
G[df]. According to Proposition 14.31 the game G[a{] is normalised hence Proposition 13.91 
guarantees that g is consistent in G[o"f]. □ 

Let us compare briefly the permutation-enumeration and the permutation-improvement 
algorithms. Each improvement step of the permutation-improvement algorithm requires the 
computation of values of a one-player SSG, which can be performed using linear program- 
ming. These values could be computed as well using a permutation-improvement policy or 
a strategy-improvement algorithm in order to avoid linear programming altogether. Either 
way, we have to forfeit one of the advantages of the permutation-enumeration algorithm: 
the computational simplicity of its inner loop. On the other hand, we do not know any 
non-trivial lower bound on the number of loops in a run of the permutation-improvement 
algorithm: it may be polynomial. 

4.3. Soundness and correctness of the permutation-improvement algorithm. The 

correctness proof is based on the following two results. 

Lemma 4.2. Let a be a positional strategy for Max and f be a permutation. If f is live in 
G[a] it is also live in G. 
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Proof. Let Wf and Xf denote the f-regions in G and G[a], respectively. By definition, 
Uj>jWf \j] is the deterministic attractor of Max to {fj, . . . , ffc,®} in G, while Uj > iXf [j] is 
the same attractor in G[a]. As the moves of Max are restricted in G[a], we get 

yi<i<k,[jx f [j}Q[jWf[j} . (4.1) 

j>i j>i 

Thus, the liveness of f in G follows from its liveness in G[a], and Lemma 14.21 ensues. □ 

Proposition 4.3. Let f be a live permutation. Then G[o~f] is normalised. 

Proof. In the proof of Proposition l3.6l we have shown the existence of a positive real number 
a such that for any strategy r for min and vertex u^®, FZ t,T (V n = ©) > a k hence only ® 
has value in G[<7f]. Clearly only © has value 1 in G[tjf] hence Proposition 14.31 follows. D 

4.4. Termination of the permutation-improvement algorithm. The value of a strat- 
egy a is denoted valo- and defined by: 

Wey,val a (u) = infP£' T (Reach(®)) . 

T 

For proving termination of the permutation-improvement algorithm we prove that successive 
strategies <7f chosen by the algorithm have greater and greater values. 

Lemma 4.4. Let f be a live permutation in G and g be a live and self- consistent permutation 
in G[(Tf]. Then for all v G V , 

vaJ^Cu) < v&l as (v) . (4.2) 
Moreover, if for all v G V, v&l as (v) = v&l at (y) then g is self-consistent in G. 

Proof. A key remark in the proof of Lemma 14.41 is that: 

val CTf (gi) < valo- f (g 2 ) < . . < val CTf (g fc ) . (4.3) 
Let V>f, g be the g- values in Gfuf]. The self-consistency of g in G[af] is: 

^f,g(gi) < ^f,g(g2) < < V*f,g(gfc) • 
Lemma 13.71 applied to G[o~f] implies that the g-strategy of player Min in G[o~f] is optimal 
in G[<7f] hence V'f.g = val^ and (|4.3p follows. 

Consider now the sequence (cr n ) ne N ; where a n is the strategy where Max plays according 
to <7 g until the pebble has visited n random vertices, and plays according to <7f afterwards. 
In particular a = <7f. We show that for every vertex v the sequence (val cr n(7;)) ngN is non- 
decreasing and that its limit is less than val CT (v). Since a = o~f this will prove Lemma (|4.2p . 

We first show by induction that for any integer n, 

Vu G V, val CT n+i(v) > val a -n(v) . (4.4) 

Basis (n = 0): We have to prove that values of a 1 are greater than values of <7f. Let v be 
a vertex, i be the index of the g-region of v in G, and j be the index of the g-region of v 
in Cr[<7f]. As the moves of Max are restricted in G[o~f] the definition of the g-regions gives 
i > j and (|4.3p yields: 

val CTf (gi) > val CTf (gj) . (4.5) 
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If Max plays with a , the definition of a s ensures that the first random vertex belongs 
to {gi,gi+i, • • • ,gjfc,®}, so val CT i(u) > min{valo f (gj), val CTf (gi+i), . . . , vaL, f (g fc ), 1} and (jO]) 
yields: 

val CT i(w) > val CTf (gj) . (4.6) 

On the other hand we prove: 

val CTf (v) = val^g,-) . (4.7) 
Let -0f jg denote g- values in G[<r g ]. We have already proved above that Vf, g is equal to 
vaLj f . By definition of j, gj is the first random vertex in a play in G\<j{\ starting from v and 
consistent with a g-strategy for Min in G[crf] hence V'f,g( u ) = V'f g(g?) which yields (|4.7p . 

It follows from (j4"3j) . (ftllj) . and ([4T7]) that (|Q|) holds for n = 0. 
Inductive step (n n + 1): The strategies cr™" 1 " 2 and a n+1 coincides with a s until the 
first visit to a random vertex. Then a n+2 switches to a n+1 while a n+1 switches to a n . By 
induction hypothesis, val CT n+i > vaLj-n, so val CT n+2 > val^n+i and ()4.4p holds for n + 1. 

Now we show that for every v, lim ne N val^ (v) < vaLj- g (t>). Let r be a strategy for Min. 
We have: 

P£ g ' r (Reach(®)) = P£ g,T (^Reach(®)) , (4.8) 
= limP£ g,T (Vm < n, V m + ®) , 

n 

= limPf ' T (Vm < n,V m / ®) , (4.9) 

n 

> limPf >r (Reach®) > limval^^) , (4.10) 

n n 

where fj4.8j) follows from Proposition 13.61 (|4.9p holds because <r g coincides with a n for at 
least n steps, and (|4.10p by event inclusion and by definition of the value. This holds for 
every strategy r hence val CTg (w) > lim n valo-n (v). 

Altogether, val^ (v) = val cr o(v) < vaLji^) < • • • < lim n valo-n (v) < valo g (f) hence (|4.2p . 
which achieves to prove the first part of the lemma. 

Let us suppose now that valo f = valo g . Equation (|4.3p yields: 

val CTg (gi) < val ffg (g2) < • • • < val ffg (g fc ) . (4.11) 

We can thus apply Proposition 13.91 to g in G[<r g ] which yields the self-consistency of g in 
G[cj g ]. By definition of g-zones, they coincide in G and G[cjg] hence the g- values are equal 
in G and G[a s ] and g is also self-consistent in G. □ 



Conclusion 

We have presented two algorithms computing optimal strategies in simple stochas- 
tic games: the permutation-enumeration and the permutation-improvement algorithms. 
Both of them rely on the existence of optimal permutation strategies. The permutation- 
enumeration algorithm simply tests every permutation until it finds a live and self-consistent 
one. The permutation-improvement algorithm uses a smarter policy in order to choose a 
"better" permutation in the next round, a la Hoffman-Karp. 

The permutation-enumeration algorithm has exponential worst-case complexity but it is 
a witness that solving SSGs is fixed-parameter tractable when the parameter is the number 
of random vertices. The nominal complexity of the permutation-improvement algorithm is 
a bit higher but we do not know any non-trivial lower bound on the number of improvement 
steps: the permutation-improvement algorithm may actually run in polynomial time. 
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Whether simple stochastic games are solvable in polynomial time remains a challenging 
open question. 

Acknowledgements We would like to thank Marcin Jurdzihski for some fruitful discus- 
sions, the anonymous reviewers for several useful suggestions and Julien Cristau for his 
invaluable comments during the writing of the final version. 



References 

[AHMS08] Daniel Andersson, Kristoffer Arnsfelt Hansen, Peter Bro Miltersen, and Troels Bjerre S0rensen. 

Deterministic Graphical Games Revisited. In Proceedings of CiE'08, volume 5028 of LNCS, pages 

1-10. Springer- Verlag, 2008. 
[Bil95] Patrick Billingsley. Probability and Measure. John Wiley & Sons, 1995. 

[Con92] Anne Condon. The Complexity of Stochastic Games. Information and Computation, 96(2) :203- 
224, 1992. 

[Con93] Anne Condon. On Algorithms for Simple Stochastic Games. In Advances in Computational Com- 
plexity Theory, volume 13 of DIM ACS Series in Discrete Mathematics and Theoretical Computer 
Science, pages 51-73. American Mathematical Society, 1993. 

[Der72] Cyrus Derman. Finite State Markovian Decision Processes. Academic Press, 1972. 

[Dix82] John D. Dixon. Exact Solution of Linear Equations Using p-adic Expansions. Numerische Math- 
ematik, 40:137-141, 1982. 

[Gil57] Dean Gillette. Stochastic Games with Zero Stop Probability, volume 3 of Contributions to the 
Theory of Games, pages 179-187. Princeton University Press, 1957. 

[Hal07] Nir Halman. Simple Stochastic Games, Parity Games, Mean Payoff Games and Discounted Payoff 
Games are all LP-Type Problems. Algorithmica, 49:37-50, 2007. 

[HK66] Alan J. Hoffman and Richard M. Karp. On Nonterminating Stochastic Games. Management 
Science, 12(5):359-370, 1966. 

[HK79] Arie Hordijk and L.C.M. Kallenberg. Linear Programming and Markov Decision Chains. Man- 
agement Science, 25(4):353-362, 1979. 

[Hor08] Florian Horn. Random Games. PhD thesis, Universite Paris 7 and RWTH Aachen, 2008. 

[Kha79] Leonid G. Khachiyan. A Polynomial Algorithm in Linear Programming. Soviet Mathematics 
Doklady, 20:191-194, 1979. 

[LL69] Thomas M. Liggett and Steven A Lippman. Stochastic Games with Perfect Information and 
Time Average Payoff. SI AM Review, ll(4):604-607, 1969. 

[Lud95] Walter Ludwig. A Subexponential Randomized Algorithm for the Simple Stochastic Game Prob- 
lem. Information and Computation, 117(1):151-155, 1995. 

[Ren88] James Renegar. A polynomial-time algorithm, based on newton's method, for linear program- 
ming. Mathematical Programming, 40(l):59-93, 1988. 

[Sha53] Lloyd S. Shapley. Stochastic Games. In Proceedings of the National Academy of Science of the 
USA, volume 39, pages 1095-1100, 1953. 

[Som05] Rafal Somla. New Algorithms for Solving Simple Stochastic Games. 119(l):51-65, 2005. 



This work is licensed under the Creative Commons Attribution-NoDerivs License. To view 
a copy of this license, visit http://creativecommons.0rg/iicenses/by-nd/2.o/ or send a 
letter to Creative Commons, 171 Second St, Suite 300, San Francisco, CA 94105, USA, or 
Eisenacher Strasse 2, 1 0777 Berlin, Germany 



